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1  Introduction 


Parallel  computing  is  an  enabling  technology  for  many  areas  of  science  and  engineering.  Parallel 
computing  is  also  our  best  investment  to  continue  achieving  performance  gains  as  the  limits 
of  semiconductor  technology  are  approached.  Whereas  the  design  of  parallel  computers  is 
reasonably  well-understood,  the  programming  of  these  machines  remains  in  its  infancy.  As  a 
consequence,  programming  parallel  computers  is  substantially  more  difficult  than  programming 
conventional  uniprocessors.  In  fact,  it  is  commonly  said  that  the  development  of  effective 
parallel  programming  techniques  is  the  main  challenge  faced  by  high-performance  computing 
toda}^ 

Unless  this  challenge  is  met,  the  acceptance  of  parallel  computers  will  remain  slow,  thereby 
hampering  progress  in  computational  science  and  engineering.  An  appealing  strategy  to  meet 
this  challenge  is  to  use  compilers  to  translate  conventional  programs  into  parallel  form.  Such 
compilers  would  enable  the  execution  of  existing  programs  on  new  parallel  machines,  thus 
allowing  a  seamless  transition  into  parallel  computing.  With  the  support  of  such  compilers, 
not  only  would  legaoy  codes  could  be  easily  ported,  but  also  new  programs  could  be  developed 
in  the  familiar  sequential  paradigm,  thus  liberating  the  programmer  from  the  complexities  of 
explicit,  machine-oriented  parallel  programming. 

This  is  the  final  report  of  the  ARPA  project  entitled  A  New- Generation  Parallelizing  Com¬ 
piler  System  whose  objectives  were  to  advance  the  state  of  the  art  in  automatic  program  paral¬ 
lelization  and  to  demonstrate  that,  with  advanced  parallelization  techniques,  real  conventional 
Fortran  programs  can  be  automatically  translated  into  effective  parallel  codes.  As  can  be  seen 
from  the  discussion  below,  we  have  succeeded  on  both  objectives.  In  fact,  the  project  reported 
here  was  the  first  compiler  research  project  in  more  than  a  decade  to  show  substantial  progress 
in  automatic  parallelization. 

As  part  of  this  project,  we  built  Polaris,  an  experimental  translator  of  conventional  Fortran 
programs  which  targets  shared-memory  multiprocessors  such  as  the  SGI  Challenge  and  the 
SUN  Enterprise  server,  as  well  as  scalable  machines  with  a  global  address  space,  such  as  the 
Cray  T3D.  In  Section  2,  we  briefly  describe  the  internal  organization  of  Polaris.  In  Section 
3,  we  enumerate  the  techniques  applied  by  Polaris  and  elaborate  on  how  they  were  chosen 
for  implementation.  In  Section  4,  we  present  an  evaluation  of  the  effectiveness  of  Polaris,  and 
Section  5  covers  ongoing  work  as  well  as  possible  future  directions  of  the  Polaris  project.  Finally, 
in  Section  6  we  discuss  the  educational  impact  of  the  project  as  well  as  its  impact  on  other 
research  groups  and  on  industry.  More  extensive  surveys  of  the  results  of  this  project  can  be 
found  in  [7],  [8],  [17],  and  [Ij.  The  references  at  the  end  of  this  document  are  the  publications 
of  the  project  including  theses  and  technical  reports. 

2  The  Polaris  System 

Polaris  was  developed  as  a  vehicle  to  experiment  with  program  analysis  and  restructuring 
techniques  and  to  help  in  their  evaluation  by  generating  executable  code  for  parallel  computers 
with  a  global  address  space.  We  selected  these  machines  as  targets  of  this  project,  rather  than 
machines  requiring  message  passing,  to  avoid  the  complexities  of  message  generation.  In  this 
way  we  were  able  to  focus  all  our  energies  on  improving  the  accuracy  of  the  analysis  techniques 
to  detect  parallelism  and  on  transformation  techniques  to  enable  the  exploitation  of  parallelism. 
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In  retrospect,  this  was  clearly  the  right  choice  not  only  because  we  avoided  the  distraction  of 
generating  messages,  but  also  and  as  we  expected  because  global  address  space  has  become  the 
dominant  machine  organization.  In  fact,  industry  is  now  rapidly  converging  towards  the  global 
address  space  model. 

Polaris  presently  accepts  Fortran  programs  and  generates  parallel  Fortran  code  for  the  SGI 
and  Sun  multiprocessor  using  vendor-specific  directives.  Polaris  also  generates  parallel  code  for 
the  GUIDE  compiler  from  Kuck  and  Associates  Inc.,  a  portable  compiler  with  implementations 
for  most  shared-memory  multiprocessor  machines.  We  are  now  working  on  code  generation 
modules  for  the  Convex  Exemplar  and  the  Cray  T3D. 

We  started  developing  Polaris  in  March  1993.  The  system  was  developed  in  C++  s,nd 
presently  it  contains  about  200,000  lines  of  code.  In  the  design  of  the  internal  organization  of 
Polaris,  our  main  objective  was  to  develop  a  system  that  would  facilitate  the  development  and 
debugging  of  the  analysis  and  transformation  routines.  We  chose  to  build  the  system  around  a 
class  hierarchy  that  represents  a  Fortran  program  internally  to  the  translator.  The  methods  in 
the  hierarchy  were  designed  in  a  such  a  way  that  the  Fortran  program  is  always  well-formed  and 
the  associated  information,  such  as  the  symbol  table  and  the  control  fiow  graph,  is  automatically 
kept  consistent  with  the  program  being  transformed. 

The  class  hierarchy  also  includes  a  number  of  high-level  operations  useful  to  manipulate 
Fortran  programs  such  as  the  operations  to  construct  and  manipulate  sets.  A  detailed  descrip¬ 
tion  of  the  class  hierarchy  can  be  found  in  [12]  and  [13].  As  part  of  the  internal  organization,  we 
have  also  designed  a  language  extension  to  C++  that  we  call  FORBOL  [36].  This  is  a  collec¬ 
tion  of  high-level  operations  for  pattern  matching  and  replacement  that  allows  a  very  compact 
expression  of  a  class  of  frequently  applied  pattern-based  transformations. 

The  internal  organization  of  Polaris  has  proven  quite  effective.  Thanks  to  the  object-oriented 
approach  we  followed,  the  number  of  errors  associated  with  bookkeeping  operations  decreased 
substantially  relative  to  the  number  of  errors  in  a  more  conventional,  non  object-oriented, 
approach.  Furthermore,  the  implementation  of  transformations  was  simplified  dramatically  by 
the  availability  of  an  extensive  collection  of  powerful  high-level  operations.  As  a  consequence, 
we  were  able  to  complete  the  implementation  of  the  system  in  a  relatively  short  time.  In  fact, 
the  first  version  of  Polaris  was  completed  in  only  two  years  with  the  participation  of  only  six 
half-time  students  and  two  software  engineers. 

3  Techniques  Implemented  in  Polaris 

The  most  important  techniques  implemented  in  Polaris  were  identified  as  important  in  a  study 
of  the  effectiveness  of  commercial  Fortran  parallelizers  [11]  that  preceded  this  project.  In  that 
study,  the  Perfect  Benchmarks,  a  collection  of  conventional  Fortran  programs  representing  the 
typical  workload  of  high-performance  computers,  were  compiled  for  the  Alliant  FX/80,  an 
eight-processor  multiprocessor  popular  in  the  late  1980s.  For  each  program,  we  computed  the 
speedup  as  a  measure  of  the  quality  of  the  parallelization.  Although  the  speedups  obtained 
were  satisfactory  in  a  few  cases,  the  Alliant  Fortran  compiler  failed  to  deliver  any  significant 
speedup  for  the  majority  of  the  programs. 

The  main  reason  for  this  failure  was  the  inability  of  the  Alliant  compiler  to  analyze  and 
transform  into  parallel  form  some  of  the  most  time-consuming  multiply-nested  loops  in  the 
Perfect  Benchmarks.  In  retrospect,  this  was  not  surprising  because  the  parallelization  module 
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of  the  Alliant  compiler  was  originally  developed  for  vectorization.  Commercial  multiprocessors, 
such  as  the  Alliant  FX/80,  were  relatively  new  in  the  late  1980s,  and  the  parallelization  modules 
of  their  compilers  were  usually  based  on  translators  for  vector  machines.  Vectorizers,  in  their 
quest  for  isolating  loop  kernels  and  transforming  them  into  vector  operations,  focus  primarily 
on  innermost  loops.  Compilers  for  multiprocessors,  in  contrast,  must  focus  on  parallelizing 
outermost  loops  in  order  to  reduce  the  effect  of  the  overhead  associated  with  parallel  execution. 
Furthermore,  with  outer  loop  parallelization  it  is  possible  to  take  advantage  of  the  true  power 
of  multiprocessing  and  in  this  way  increase  the  fraction  of  code  executed  in  parallel. 

Our  study  showed  that  extending  the  four  most  important  analysis  and  transformation 
techniques  traditionally  used  for  vectorization  -  dependence  analysis,  privatization,  induction 
variable  substitution,  and  reduction  substitution  -•  led  to  significant  increases  in  speedup.  The 
bulk  of  our  energies  went  to  the  development  and  implementation  of  the  enhanced  version  of 
these  four  techniques  plus  two  others  needed  for  interprocedural  analysis:  iTiUuing  and  Inter- 
procedural  Value  Propagation. 

In  the  area  of  dependence  analysis,  we  developed  a  new  technique,  the  Range  Test,  capable 
of  dealing  with  non-affine  subscript  expressions.  The  implementation  of  this  test  was  supported 
by  powerful  computer  algebra  algorithms  and  propagation  techniques  [3,  6,  2,  4,  9,  5]  .  For  pri¬ 
vatization,  we  developed  a  new  strategy  based  on  flow  analysis  and  also  supported  by  powerful 
computer  algebra  and  attribute  propagation  algorithms  [33,  35,  34,  32].  New  techniques  were 
also  developed  for  induction  and  reduction  substitution  [15,  19,  20,  21]. 

We  also  implemented  a  very  powerful  and  robust  inlining  module  to  help,  albeit  in  a  limited 
way,  with  interprocedural  analysis  [14].  Polaris  inlines  automatically  all  subroutines  which  meet 
a  set  of  criteria.  We  call  this  last  strategy  Autoinlining.  The  other  technique  for  interprocedural 
analysis  applied  by  Polaris  is  the  Interprocedural  Value  Propagation  (IP  VP)  of  integer  variables. 
IPVP  clones  subroutines  when  different  call  sites  supply  diflFerent  values  to  the  parameters. 
As  can  be  seen  below,  autoinlining  and  IPVP,  although  not  as  powerful  as  a  comprehensive 
interprocedural  analysis  algorithm,  improved  the  accuracy  of  program  analysis  in  several  cases. 

4  Effectiveness  of  Polaris  Techniques 

Figure  1  presents  a  speedup  comparison  between  Polaris  and  SGPs  parallelizer  PFA  for  sixteen 
benchmark  programs,  from  three  different  sources: 

f 

A 

•  arc2d,  bdna,  f  lo52,  mdg,  ocean,  and  trf  d  from  the  Perfect  Benchmarks  suite; 

•  applu,  appsp,  hydro2d,  su2cor,  swim,  tfft2,  tomcatv,  and  wave  from  the  SPEC95  bench¬ 

mark  suite;  and, 

•  cmhog  and  cloudSd  from  NCSA. 

The  first  two  suites  are  well-known.  The  third  source  is  a  collection  of  programs  currently 
being  used  in  research  at  the  National  Center  for  Supercomputing  Applications  (NCSA).  These 
programs  are  of  moderate  size,  containing  between  10,000  and  20,000  lines  each. 

The  programs  were  executed  (in  real-time  mode  for  timing  accuracy)  on  an  eight  processor 
SGI  Challenge  with  150  MHz  R4400  processors.  Figure  1  shows  that  Polaris  delivers,  in  many 
cases,  substantially  better  speedups  than  PFA. 
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Perfect  Spec  NCSA 


Figure  1:  Speedup  Comparison  Between  the  SGI  PFA  Compiler  and  Polaris 
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In  three  of  the  sixteen  programs,  PFA  produces  better  speedups  than  Polaris.  The  reason 
is  that  PFA  uses  an  elaborate  code  generation  strategy  that  includes  loop  transformations, 
such  as  loop  interchanging,  unrolling,  and  fusion,  which,  when  applied  to  the  right  loops, 
improve  performance  by  decreasing  overhead,  enhancing  locality,  and  facilitating  the  detection 
of  instruction-level  parallelism.  However,  the  elaborate  strategy  has  a  negative  effect  for  the 
program  tomcatv,  in  that  PPA  detects  as  much  parallelism  as  does  Polaris,  but  the  Polaris- 
transformed  program  runs  faster. 

To  evaluate  the  effectiveness  of  the  six  techniques  implemented  in  Polaris,  we  transformed 
each  program  six  times,  with  one  technique  either  turned  off  or  weakened  in  each,  and  then 
compared  those  with  the  program  compiled  with  all  techniques  at  full  strength.  In  the  case 
of  Autoinlining  and  Interprocedural  Value  Propagation,  the  techniques  were  turned  off  com¬ 
pletely.  In  the  case  of  Privatization,  Dependence  Analysis,  Induction  Variable  Recognition, 
and  Reduction  Recognition,  only  the  advanced  components  of  the  techniques  were  turned  off. 
For  the  Dependence  Analysis  measurement,  the  Equality  and  GCD  Tests  were  applied,  while 
the  Range  Test  was  not.  For  the  Reduction  Recognition  measurements,  only  scalar  reduction 
variables  were  considered.  For  Privatization,  only  scalar  privatization  was  applied.  For  the  In¬ 
duction  Recognition  measurement,  only  induction  variables  with  linear  closed  forms  were  taken 
into  account. 

Figure  2  presents  the  results  of  the  six  experiments  conducted  for  each  code.  The  height  of 
the  bar  at  (P,  T)  represents,  in  logarithmic  scale,  the  percentage  of  the  total  number  of  loops 
in  program  P  which  become  serialized  by  disabling  technique  T.  From  Figure  2,  it  can  be  seen 
that  all  techniques  enable  parallelization  of  loops  from  many  programs  in  the  collection.  This 
shows  that  although  the  inspiration  for  the  techniques  came  from  analyzing  programs  in  only 
the  Perfect  Benchmarks,  they  are  also  effective  on  programs  outside  the  Perfect  Benchmarks. 


5  The  Future:  Detecting  the  Remaining  Parallelism  and  Im¬ 
proving  Performance 

Polaris  has  detected  much,  but  not  all,  of  the  parallelism  available  in  our  set  of  benchmark 
codes.  A  careful  analysis  of  the  loops  in  our  benchmark  codes  which  Polaris  could  parallelize, 
but  does  not,  shows  several  areas  where  improvement  is  possible. 

First,  we  need  to  extend  all  of  our  analysis  interprocedurally.  Although  IPVP  has  proven 
quite  useful,  a  true  interprocedural  framework  for  all  analysis  is  still  needed.  However,  due 
to  the  large  amount  of  information  required  by  our  analysis  algorithms,  a  traxlitional  global 
propagation  algorithm  would  be  too  inelRcient.  We  are,  therefore,  focusing  on  a  demand-driven 
strategy  that  would  allow  high  accuracy  without  significantly  increasing  the  analysis  time. 

Second,  improving  our  analysis  techniques  for  dependence  and  privatization  is  important. 
When  compile  time  analysis  fails,  analysis  code  needs  to  be  generated  for  use  at  runtime.  This 
will  allow  Polaris  to  parallelize  even  loops  with  access  patterns  determined  by  input  values. 

In  fact,  although  not  used  to  obtain  the  results  described  in  the  preceding  section,  we  have 
also  developed  and  implemented  a  novel  runtime  dependence  and  privatization  test  [26,  28, 
31,  30,  27,  25,  23,  29,  24,  22,  16]  to  enable  the  parallelization  of  those  codes  whose  memory 
access  pattern  is  not  known  at  compile  time.  Such  situations  arise,  for  example,  in  sparse 
computations.  Preliminary  results  look  quite  promising.  We  plan  to  conduct  more  extensive 


6 


NCSA 

Figure  2:  Percentage  of  Loops  Serialized  when  Disabling  Certain  Techniques 


experiments  in  the  near  future. 

Third,  Polaris  needs  to  take  into  account  additional  program  patterns,  such  as  more  compli¬ 
cated  forms  for  induction  and  reduction  variables,  associative  recurrences,  multiple  exit  loops, 
and  loops  containing  I/O  statements.  We  have  identified  many  important  program  patterns  by 
analyzing  many  codes  by  hand.  We  plan  to  continue  these  hand  studies  and  extend  them  to 
include  collections  other  than  the  three  mentioned  above. 

Fourth,  we  have  to  develop  machine-specific  optimizations  to  deal  with  important  issues 
such  as  locality  and  communication.  We  are  also  developing  machine-specific  optimizations 
for  shared-memory  workstations  [10].  Also,  we  have  recently  completed  the  first  version  of  a 
code  generator  for  the  Cray  T3D  [18]  and  are  now  working  on  a  similar  module  for  the  Convex 
Exemplar.  The  preliminary  results  on  the  Cray  T3D  are  quite  promising.  These  results  indicate 
that  in  many  cases  it  is  possible  to  translate  conventional  code  quite  effectively  for  scalable 
machines. 

We  also  must  improve  the  efficiency  of  Polaris  in  order  to  allow  us  to  compile  very  large 
Fortran  programs.  Although  Polaris  can  routinely  compile  programs  with  10,000  lines,  we  hope 
to  eventually  be  able  to  compile  programs  with  100,000  to  1,000,000  lines. 

6  Impact  of  the  Polaris  Project 

The  Polaris  project  has  had  a  wide  impact  on  both  academia  and  industry.  Group  members  have 
presented  papers  at  a  wide  variety  of  conferences  and  workshops,  winning  a  Best  Paper  Award 
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at  the  ACM  International  Conference  on  Supercomputing  in  1995.  The  group  bibliography 
contains  more  than  30  publications,  including  theses. 

The  group  has  produced  seven  graduates,  3  PhDs  and  4  Masters.  Some  of  these  students 
are  now  employed  by  top  computer  companies  (Intel,  Hewlett-Packard,  Silicon  Graphics,  and 
Microsoft),  and  one  is  a  faculty  member  at  Texas  A&M  University. 

Polaris  technology  has  generated  much  interest  from  several  industrial  compiler  groups.  We 
have  maintained  close  working  ties  with  Kuck  and  Associates,  and  Polaris  technology  is  being 
integrated  into  their  KAP  parallelizer.  Interest  within  IBM  has  culminated  in  a  partnership 
award  that  will  allow  us  to  target  Polaris  at  the  new  COMA  multiprocessor  under  develop¬ 
ment  at  T.J  Watson  Research  Center.  Interest  at  SGI,  Convex,  and  Sun  has  generated  much 
interaction  between  their  compiler  groups  and  ourselves. 

The  Polaris  parallelizer  has  been  well-accepted  around  the  world.  We  now  have  24  licensed 
Polaris  users.  Twelve  are  from  the  US.  The  other  twelve  are  from  Canada,  Europe,  Japan, 
Korea,  and  Australia.  Eleven  users  have  been  gained  in  the  last  six  months. 

These  users  are  using  Polaris  in  a  variety  of  ways.  At  Cornell  University,  Polaris  has  served 
as  the  core  parallelizer  in  a  compiler  course,  upon  which  students  implement  compiler  projects. 
Polaris  was  also  used  in  a  graduate  course  at  Toronto  and  will  be  used  next  semester  in  a 
similar  course  at  Texas  A&M  University.  At  the  University  of  Malaga  in  Spain,  researchers 
have  been  testing  the  usefulness  of  Polaris  for  parallelizing  sparse  codes.  In  Barcelona,  Spain, 
researchers  are  analyzing  codes  for  instruction-level  parallelism  with  Polaris,  using  the  powerful 
analytical  capabilities  of  Polaris  to  inform  the  code  generation  module,  which  they  have  written. 
The  University  of  Toronto  is  using  Polaris  as  the  basis  for  their  NUMAchine  parallelizer.  At 
Purdue,  they  are  targeting  Polaris  at  a  Sun  multiprocessor  workstation.  At  the  University  of 
Rochester,  Polaris  is  being  used  as  an  integral  part  of  a  data  access  visualization  environment 
(DAVE). 

Several  of  the  most  active  users  will  join  us  in  a  POLARIS  exhibit  at  Supercomputing  ’96  in 
November.  We  will  give  on-line  demonstrations  of  the  variety  of  uses  to  which  Polaris  is  being 
put. 

7  Conclusions 

The  techniques  implemented  in  the  first  version  of  Polaris  have  proven  quite  effective  on  a 
variety  of  programs.  This  verifies  the  results  of  our  study  involving  the  Alliant  machine.  Even 
so,  we  find  that  further  progress  is  necessary.  A  wide  range  of  techniques  is  necessary  to  detect 
as  much  parallelism  as  is  present  in  real  programs. 

Perhaps  the  most  important  achievement  of  our  work  in  the  Polaris  project  has  been  the 
demonstration  that  substantial  progress  in  compiling  conventional  languages  is  possible.  There 
have  been,  and  still  are,  many  who  do  not  believe  it  is  possible  to  develop  compilers  capable  of 
generating  effective  parallel  code  for  a  wide  range  of  real  programs.  We  disagree,  and  believe 
that,  at  least  for  Fortran  and  other  similar  languages  such  as  MATLAB,  effective  parallelizers 
will  be  available  within  the  next  decade.  Our  experimental  results  and  careful  hand  analysis  of 
real  codes  strongly  support  this  opinion. 
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